Nature Biotechnology — Latest Matching Preprints

1

SpaceBio Knowledge Hub: A LiteratOmics Platform for Microgravity and Space Biology Research

Silva, J. C. F.; Vieira, A.; Chue Donahey, M. S.; Silva, S. M. d. C.; Veloso, T.; Lopes, A.; Sexson, N.; Barker, R.; Porterfield, D. M.; Silva, C. A.; Dias, R.

2026-07-14 scientific communication and education 10.64898/2026.07.13.737239 medRxiv

Top 0.1%

31.4%

Show abstract

Space biology literature is growing exponentially. Existing infrastructure has not kept pace with organizing, synthesizing, and disseminating this knowledge. We present SpaceBio SpaceBio Knowledge Hub (www.spacebio.space), an integrated digital ecosystem that combines artificial intelligence, real-time data integration, and open-access infrastructure to advance research, education, and collaboration in microgravity, space biology and space exploration. The platform applies AI-driven approaches including natural language processing, machine learning, and automated content generation to construct a semantic atlas of the field. The atlas reveals the hierarchical thematic organization underlying microgravity-induced biological responses, space mission infrastructure, planetary science, and astrobiology. As part of this effort, SpaceBio is moving toward the construction of a LiteratOmics framework for microgravity, and space biology a systematic, AI-enabled approach to mining, integrating, and structuring the primary literature generated by omics-driven spaceflight research, treating the scientific literature itself as a navigable data layer alongside genomic, transcriptomic, and proteomic datasets. Built on a scalable, cloud-based architecture with a user-centered interface, SpaceBio supports literature exploration, data integration, and knowledge discovery for researchers, educators, students, industry partners, and citizen scientists. The platform also functions as a community-building ecosystem. It integrates hands-on research initiatives, AI-generated educational content, pilot data science projects, and social responsibility programs that broaden participation without compromising scientific rigor. AI-enabled digital environments can transform fragmented literature into a navigable knowledge landscape. SpaceBio accelerates research productivity, strengthens STEM education, and supports the global space life sciences community as human space exploration enters in the most ambitious era.

2

ARCHIVE: Machine-Guided Design of an Efficient Open-Ended DNA Recording Device to Increase Resolution of Multiplexed Cell History Tracking

Rosenstein, A. H.; Garton, M.

2026-07-13 synthetic biology 10.64898/2026.07.10.737758 medRxiv

Top 0.1%

18.4%

Show abstract

Engineering cell-based devices to record events into DNA has potential both as a non-ablative research tool and in the clinic for enacting gene-circuit-based logic of cell therapies conditional on cell history. Whether as a means of understanding interactions on the single-cell level, or reconstructing histories of cellular events, a cellular DNA recording device has widespread utility, with prime editing-based methods at the forefront of this endeavor - notably peCHYRON. Yet, the resolution of such open-ended recording tools are inherently constrained by edit insertion efficiency and cannot yet capture RNA-polymerase II-transcribed signals, which represent a large segment of functionally-defined endogenous gene-regulatory architectures. Here we present ARCHIVE (Amplified Recording of Cellular Histories into Information-dense Vectors of Events), capable of integrating RNA-encoded signals into a predefined genomic recording locus with high efficiency. By utilizing deep-learning assisted prediction of prime-editing efficiency as a surrogate fitness model for generative in silico pegRNA evolution, we developed a recording device with an order-of-magnitude improvement in temporal resolution (efficiency of iterative message integration steps) compared to the state of the art - a capability we establish here at the level of constitutive promoter tracking. We expect ARCHIVE to serve as a launching point for more advanced mammalian synthetic-biology recording devices for both functional genomics and therapeutics research.

3

NinjaSeq: programmable restriction enzyme-based sequencing library preparation with random access for DNA data storage

Galminas, I.; Sabary, O.; Abraham, H.; Kaminskaite, K.; Cohen, T.; Gruodyte, V.; Alzbutas, G.; Yakhini, Z.; Palepsiene, R.; Zemaitis, L.; Yaakobi, E.; Juzenas, S.

2026-07-08 molecular biology 10.64898/2026.06.11.730843 medRxiv

Top 0.1%

16.2%

Show abstract

DNA data storage allows sequences to be defined without biological constraints, yet readout workflows still depend on generic end-repair/dA-tailing chemistry. We developed NinjaSeq, a type IIS restriction endonuclease library-preparation strategy that incorporates recognition sites into primer flanks, enabling digestion to generate adapter-compatible overhangs and eliminating the need for conventional end preparation. By combining this chemistry with constrained coding that excludes internal recognition motifs, NinjaSeq produced sequencing quality and decoding performance consistent with standard protocols while reducing reagent burden and simplifying processing, including compatibility with one-pot restriction-ligation. The same sequence-directed design also enables physical random access during library preparation: targeting file-specific flanking sites enriched a desired file from a mixed pool by about sixteen-fold in a proof-of-concept experiment. These results position NinjaSeq as a practical ONT readout approach for DNA data storage. HIGHLIGHTSO_LINinjaSeq replaces end-repair/dA-tailing with REases for nanopore sequencing C_LIO_LIConstrained encoding excludes recognition motifs to protect payloads from cleavage C_LIO_LINinjaSeq achieves decoding accuracy comparable to standard library preparation C_LIO_LIDesigning file-specific RRS enables random access during library preparation C_LI

4

Lossless compression of k-mer matrices enabling random row access

Regnier, A.; Lemane, T.; Bellenous, S.; Chikhi, R.; Peterlongo, P.

2026-07-08 bioinformatics 10.64898/2026.07.03.736306 medRxiv

Top 0.1%

14.7%

Show abstract

Genomic search engines such as Logan-Search index petabytes of sequencing data as large binary matrices, called k-mer matrices, where each row encodes the presence of a k-mer across thousands to millions of genomic samples. Logan-Search contains a petabyte of binary matrices, and storing them is expensive, yet compression must not prevent fast random access to any matrix row at query time. We present kmcomp, a lossless compression method for k-mer matrices that satisfies these competing requirements. Block compression partitions the matrix into fixed-size row blocks, each compressed independently; block start positions are stored in an Elias-Fano encoded array, enabling O(1) random access to any block. To improve compressibility without introducing additional decompression steps, we introduce the {pi}-compression: a column reordering that groups similar samples together by solving the Traveling Salesman Problem via a nearest-neighbor heuristic. We accelerate this heuristic with a novel variant of the vantage-point tree, the masked vp-tree, which dynamically prunes nearest-neighbor search space. On three (meta)genomic datasets, kmcomp achieves compression ratios of 1.3 to 5.4; {pi}-compression further improves these to 1.5 to 51.3. Applied to the Logan-Search petabyte-scale index, compression reduces storage by approximately half, and {pi}-compression adds a further 13% gain. Query overhead remains modest: queries of hundreds of nucleotides incur an absolute latency increase of {approx} 100 ms, and highly compressed indexes can match uncompressed query times thanks to reduced disk reads.

5

A Simple, Cost-Effective, High-Throughput Method for Measuring Chromatin Accessibility and Gene Expression in Single Nuclei

Luo, Z.; Greenleaf, W. J.

2026-06-30 genomics 10.64898/2026.06.29.735326 medRxiv

Top 0.1%

13.0%

Show abstract

We describe microfluidic-free, droplet-based methods for single-nucleus epigenomic measurements: Particle-templated Instant Partition single-nucleus assay for transposase-accessible chromatin using sequencing (PIP-ATAC-seq) and its multiomic version (PIP-Multiome-seq). We benchmarked these assays by generating data sets containing thousands of nuclei using cell lines and mouse brains and compared to other established methods. PIP-Multiome and PIP-ATAC are straightforward to implement, affordable, and produce high-quality data, providing useful additions to the single-cell molecular measurement armamentarium.

6

Semantic fragment representations for coordinate-free analysis of genomics data

Heydari, H.; Zhao, J.; Arseneault, M.; Younesian, L.; Tanguay, S.; Riazalhosseini, Y.; Goodarzi, H.; Najafabadi, H. S.

2026-07-10 genomics 10.64898/2026.07.09.737627 medRxiv

Top 0.1%

12.8%

Show abstract

Many genomic assays begin with individual DNA fragments, but standard analysis quickly collapses those molecules into counts over genomic intervals. Rich information carried by each fragment, including its sequence, fragment body, cleavage boundaries, and local flanking context, is lost in this process. This loss is especially apparent in mixed-source and heterogeneous samples, where individual fragments originate from disparate cell types and can retain information about their cell of origin. To address this, we present LEAF-1, a fragment-level foundation model pre-trained on approximately 58 billion fragments spanning bulk ATAC-seq, single-cell ATAC-seq, and cell-free DNA profiles, representing each DNA molecule as a point in a learned semantic space defined by sequence context, assay modality, and explicit cleavage-boundary tokens. In sparse scATAC-seq datasets, mean-pooled LEAF-1 embeddings readily classify human cell types from as few as [~]1,000 fragments per cell, with high-scoring fragments linked to cell-type-associated transcription-factor programs. Similarly, in cell-free DNA profiling, LEAF-1 outperformed state-of-the-art coordinate-binning strategies and general-purpose DNA language model baselines across cancer detection tasks. Applying attention-based multiple-instance learning to LEAF-1 embeddings further improved cancer detection, reaching an area under the receiver operating characteristic (ROC) curve (AUC) of 0.95. This pan-cancer model generalizes beyond cancer types it is trained on, as we show by profiling plasma samples from clear cell renal cell carcinoma patients and healthy volunteers and applying the frozen classifier without retraining, achieving an AUC of 0.83. These results show that semantic learning over individual DNA fragments preserves biochemical, cell-associated, and disease-associated signals that are otherwise lost during coordinate-based aggregation.

7

High-fidelity rare structural variant detection with HiFiRE3 reduced representation via restriction enzyme ends

Stewart, J. A.; Mishler, J.; Ahmed, S.; Schwer, B.; Glover, T. W.; Wilson, T. E.

2026-06-29 genomics 10.64898/2026.06.24.734375 medRxiv

Top 0.1%

12.8%

Show abstract

High-fidelity detection of rare structural variants (SVs) remains challenging because library preparation and sequencing techniques generate artifactual junctions that obscure true single-molecule events. Here, we present HiFiRe3, an error-minimized sequencing framework that combines artifact-aware library design with error suppression and correction strategies to enable rare SV detection and frequency assessment across long and short-read sequencing platforms. We first systematically characterized major classes of SV artifacts, including chimeric PCR products, intermolecular ligation, sequencing platform-specific artifacts, and mapping errors. HiFiRe3 supports error correction of these artifact junctions by combining reduced representation restriction fragments with pre-ligation size selection to enable computational filtering via independent forced restriction enzyme end (FREE) and <1N size logics. In nanopore libraries, these approaches enabled targeted detection of single-molecule SVs at replication stress hotspots in cultured human cells exposed to genotoxicants and in long genes in untreated mouse brains, while markedly reducing singleton translocation artifacts. HiFiRe3 extends to PacBio sequencing for joint SV and SNV error correction and to short-read platforms for cost-efficient high-fidelity nonhomologous SV analysis. Together, HiFiRe3 is a flexible framework for accurately detecting rare genomic structural variation with broad applicability to targeted and genome-wide studies by selective application of its error correction approaches.

8

NanoCellAnnotator: Formalizing Expert Cell Type Annotation with Large Language Models

Mahmud, M. I.; Kochat, V.; Anzum, H.; Satpati, S.; Dwarampudi, J. M. R.; Rai, K.; Banerjee, T.

2026-06-25 bioinformatics 10.64898/2026.06.21.728965 medRxiv

Top 0.1%

11.9%

Show abstract

Motivation: Cell-type annotation in spatial transcriptomics is challenging due to sparse gene panels, spatial heterogeneity, and limited availability of tissue-matched reference atlases. Recent approaches have explored large language models (LLMs) for integrating biological knowledge during annotation, but unconstrained inference can produce biologically unsupported predictions and hallucinated cell types. In addition, many LLM-based pipelines rely on large cloud-hosted models that limit reproducibility and deployment in privacy-sensitive environments. Results: We introduce NanoCellAnnotator, a biologically constrained and confidence-aware framework for automated cell-type annotation in spatial transcriptomics. The framework de-couples spatial structure discovery, deterministic biological evidence construction, and language model-based semantic inference. Spatial clusters are identified using hybrid spatially regularized non-negative matrix factorization (hSNMF), after which cluster-level marker genes are abstracted into ontology-derived functional programs using Gene Ontology enrichment and GO-slim projection. A lightweight locally executable language model performs constrained label selection within a curated admissible label space derived from PanglaoDB and CellMarker. Annotation confidence is estimated independently using marker support strength and lineage separation, enabling ambiguous or heterogeneous clusters to be explicitly flagged. We evaluate NanoCellAnnotator on Xenium spatial transcriptomics data from intrahepatic cholangiocarci-noma and an independent breast cancer spatial transcriptomics dataset. The framework recovers canonical cell populations with high confidence while identifying heterogeneous or transitional spatial domains as ambiguous. Agreement with manual annotations was evaluated using accuracy and adjusted Rand index. Availability: Code available at https://github.com/ishtyaqmahmud/NanoCellAnnotator.

9

Intact and single-molecule analysis of heparan sulfate

Hristov, P.; Kakhaki, P. D.; Tzadikario, T.; Rai, S. K.; Su, G.; Olivieri, P. H.; Esko, J. D.; Liu, J.; Jain, M.; Flynn, R. A.

2026-06-29 biochemistry 10.64898/2026.06.26.734651 medRxiv

Top 0.1%

11.8%

Show abstract

Establishing tools to couple biological processes to a DNA sequence has transformed our ability to monitor life at the molecular scale due to the scalability, flexibility, and low cost of DNA sequencing. Key examples include DNA-protein (ChIP-seq1), RNA-protein (CLIP-seq2), protein-protein (proximity ligation assay3), and Cas-based recording of cellular events4. In contrast, this paradigm has not yet significantly enhanced studies of glycans, which are mostly limited to non-DNA based chemical and biochemical assays. While classical asparagine-linked and serine/threonine-linked glycans can be directly sequenced using mass spectrometry, glycosaminoglycans - notable players in the extracellular matrix - cannot be easily analyzed in their full-length form. Here we introduce HS-nano-seq, a generalized framework to selectively label, process, and detect features of heparan sulfate on a nanopore sequencing platform. Recognizing that heparan sulfate is biochemically analogous to a nucleic acid, we report purification techniques using rapid nucleic acid strategies and conjugation methods to couple DNA adapters, generating HS-DNA chimeras resolved as discrete species by capillary electrophoresis (CE). The CE assay can distinguish features of chain length and sulfation patterns. At the single-molecule level enabled by nanopore sensing, we classify a library of synthetic heparan sulfate standards and demonstrate that nanopore ionic current fingerprints encode sulfation-dependent structural features of individual HS chains. Analysis of intact, cell-derived HS could discriminate features of individual chains with different sulfation patterns, defining the heterogeneity of binding motifs across cell types and how cells organize and program the tethered extracellular matrix. More broadly, HS-nano-seq establishes a framework for achieving full-length readouts of ECM glycopolymers that are amenable to the same biological interrogation as nucleic acids.

10

Computing tumor specificity of cancer antigen targets by k-mer indexing of healthy tissue transcriptomes

Hausmann, J.; Lang, F.; Muslu, O.; Kress, L.; Landry, J.; Suchan, M.; Nubbemeyer, A.; Kuner, R.; Weber, D.; Schrörs, B.; Schulz, M. H.; Gaida, M. M.; Sahin, U.; Ibn-Salem, J.

2026-07-03 bioinformatics 10.64898/2026.06.29.734488 medRxiv

Top 0.1%

11.8%

Show abstract

Individualized cancer immunotherapies rely on tumor-specific T-cell antigens, often predicted from somatic mutations as neoantigens. For tumors with low mutational burden, mRNA transcript variants, including gene fusions and novel splice junctions, can serve as important alternative targets. A main challenge in their identification from tumor RNA-seq is to confirm that their expression is tumor-restricted. Although large public collections of healthy-tissue RNA-seq exist, verifying tumor-specific expression requires computationally expensive re-analysis of these data for every novel candi-date. To address this, we benchmarked nine k-mer indexing algorithms and devel-oped k4neo, which leverages k-mer indexing of raw RNA-seq reads to compute the tumor specificity of any transcript variant. This mapping-free and transcript-class ag-nostic approach screens any candidate sequence against 18,960 samples across 51 healthy tissue types. We confirmed k4neo's detection accuracy with qRT-PCR and showed that k4neo accurately classifies somatic and germline variants, gene fusions, and isoforms by tumor specificity. Applied to nine tumor cohorts, it nominated a medi-an of 4-80 tumor-specific splice junctions per patient, including recurrent, long-read-validated novel antigen candidates. Together, k4neo enables efficient access to large-scale sequencing cohorts and accurately computes tumor specificity for any in-put transcript sequence, thereby expanding the repertoire of individual and shared cancer antigen targets.

11

SSUplex: fast, both-strand extraction and origin-sorting of small-subunit rRNA for environmental DNA metabarcoding

O'Brien, A.; Vargas, J.; Acuna, I.; Parada, P.

2026-07-05 bioinformatics 10.64898/2026.07.02.736232 medRxiv

Top 0.1%

11.7%

Show abstract

Ribosomal RNA metabarcoding sits at the centre of how we characterise microbial and eukaryotic communities in environmental samples, and long-read sequencing has made full-length small-subunit (SSU; 16S/18S) profiling routine. The broadly conserved primers that make rRNA such a convenient marker are also its liability: by design they co-amplify organellar (mitochondrial, chloroplast) and cross-domain SSU alongside the intended target. Left unsorted before taxonomic assignment, these passengers are systematically misclassified, and the error propagates straight into estimates of community composition and diversity. Reads must therefore be detected, extracted, and sorted by origin before they ever reach a classifier. We present SSUplex, an open-source tool that detects SSU rRNA, assigns each read to one of five origins (bacteria, archaea, eukaryota, mitochondria, chloroplast), and extracts the SSU region for downstream classification. SSUplex reimplements the extraction-and-origin logic of the widely used Metaxa2 in the Rust programming language, scans both strands, and ships as a single dependency-light binary suited to long-read (Oxford Nanopore, PacBio HiFi) and short-read data. Benchmarked against Metaxa2 on public data, SSUplex reproduces Metaxa2 origin calls on full-length reads (96.8% concordance) and matches its extraction speed on small inputs, then pulls away to run up to approximately 3.4-fold faster with approximately 35% lower peak memory at 200,000 reads, the per-sample scale a long-read amplicon run typically reaches. We characterise a genuine, measured trade-off in the origin-ranking statistic, and we identify the bacteria-versus-mitochondria boundary as the method's one intrinsically lower-confidence edge. For the now-common workflow in which origin-sorted reads are handed to a dedicated classifier rather than classified in place, SSUplex is a fast, reproducible, embeddable stand-in for Metaxa2's extraction role. Source code and a benchmark harness that regenerates every result from public data are available under the MIT license at https://github.com/ayobi/ssuplex.

12

Programming T cells for intercellular genome editing

Wasko, K. M.; Maker, M.; Ngo, W.; Chen, K.; Ma, E.; Pattali, R.; Chen, E.; Leung, T.; Braverman, J.; Doudna, J. A.

2026-06-23 bioengineering 10.64898/2026.06.21.729417 medRxiv

Top 0.1%

11.5%

Show abstract

Therapeutic genome editing requires delivery of editing molecules to defined cell types, but targeting specificity and efficiency are currently limited. We hypothesized that properties inherent to immune cells, including tissue infiltration and programmed cell recognition, could be harnessed to engineer a cell-based delivery system. We show here that T cells can both produce and transfer editing machinery to target cells. In response to a programmable ligand, engineered T-lymphoid cells can transfer enzymes using complex spatiotemporal logic and deliver cargo in a cell contact-dependent or -independent manner. We demonstrate feasibility of this approach in primary human T cells, establishing a customizable genetic circuit for macromolecular delivery controlled by intercellular interactions.

13

EDTA v2: enabling scalable TE annotation in animal genomes

Ou, S.; Lu, T.; Nguyen, H.; Gerhardt, K.; Fang, N. F.; Rashid, U.; Guhlin, J.; Dainat, J.; Bao, Z.; Bayer, P. E.; Na, Y.; Benson, C.

2026-07-06 genomics 10.64898/2026.07.01.735963 medRxiv

Top 0.1%

11.5%

Show abstract

The Extensive de-novo TE Annotator (EDTA) automates transposable element annotation in plant genomes but lacks direct LINE/SINE detection, limiting its applicability to animal genomes. We present EDTA v2, which integrates LINE and SINE detection, completely rewrites TIR-Learner for deployability and scalability, and accelerates structural detectors by up to two orders of magnitude. Tested in 30 animal genomes from the Vertebrate Genomes Project Phase I, EDTA v2 bridges the non-LTR detection gap that has prevented automated TE annotation in animals.

14

Programmable CRISPRtune dissects the transcriptional repressive activity of MeCP2

Brim, J. I.; Ornelas, I. J.; Colias, P. J.; Divekar, N. S.; Xu, D.; Lubin, J. P.; Ferrel, S. I.; Galan Palma, L.; Hernandez Zamora, M. G.; Pattali, R. K.; McDaniel, J. J.; Chasins, S. E.; Nunez, J. K.

2026-07-10 genomics 10.64898/2026.07.08.737375 medRxiv

Top 0.1%

11.4%

Show abstract

The ability to control the expression of human genes is a major goal in synthetic biology, enables dissection of gene function, and can be harnessed for therapeutic applications. Advances in genome editing and transcriptional engineering often result in complete gene inactivation or full transcriptional repression. However, programmable tools to dial transcription at intermediate levels remain challenging. Here, we present CRISPRtune - a synthetic fusion of MeCP2 to catalytically dead dCas9 that tunes down transcription of endogenous genes in human cells by harnessing the mild repressor activity of MeCP2. Using pooled genome-scale CRISPR screens, we tune the expression of thousands of endogenous genes and define the targeting rules of CRISPRtune in human cells. With a platform to target MeCP2 at defined genomic sites, we show the direct epigenetic changes induced by MeCP2 at gene promoters and we identify its genetic dependency partners for productive transcriptional repression. Rett syndrome-associated mutations of MeCP2 show defects for transcriptional repression due to their failure to remodel the local epigenetic landscape of target genes. Together, we present a programmable method for transcriptional tuning in mammalian cells and offer an orthogonal platform to dissect the mechanistic function of chromatin regulators in living cells.

15

OCellus: A Language-Model Framework for Single-Cell, Spatial, and Perturbation Biology with Natural-Language Reasoning

Zhang, C.; Sun, J.; Xu, Z.; Liao, R.; Yin, A.; Gao, H.; Liu, E.; Bao, Y.; Zhao, L.; Wang, G.

2026-07-12 bioinformatics 10.64898/2026.07.08.737248 medRxiv

Top 0.2%

9.9%

Show abstract

Computational modeling of cellular behavior--the virtual cell--has emerged as a stated grand challenge at the intersection of artificial intelligence and biology, yet existing foundation models remain specialized: single-cell models process dissociated transcriptomes only, spatial models require dedicated spatial-aware architectures, and perturbation predictors depend on manually curated knowledge bases that cap generalization. Here we introduce OCellus, a single nine-billion-parameter language model (Qwen3.5-9B) fine-tuned on twenty-two biological tasks that simultaneously addresses all three limitations through three coordinated technical contributions on a shared backbone. First, EvenClock encodes two-dimensional spatial coordinates as eighteen clockface sectors of text, enabling spatial reasoning on a vanilla language model without architectural modification; on ten spatial transcriptomics tasks OCellus attains 77 percent spatial-neighborhood accuracy, 96 percent spatial-cellchat accuracy, and 0.70 proportion-cosine similarity on spatial deconvolution, all without any spatial-aware architectural components. Second, per-gene language-model embeddings replace the Gene Ontology annotations that GEARS depends on, achieving Pearson correlation 0.945 on the Replogle 2022 perturbation benchmark versus 0.84 for GEARS across 457 completely unseen knockout genes. Third, OCellus-Agent provides a Planner-Router-Verifier natural-language interface that achieves 75 percent pipeline accuracy on eighty multi-task queries. Removing language-model embeddings collapses perturbation Pearson to 0.06, confirming that learned functional representations--not graph topology--drive the gain. As a cell-type encoder, OCellus ranks first among fourteen foundation models in linear-probe accuracy at 95.1 percent across four benchmark datasets, and reaches 72.6 percent average across twenty-two evaluated biological tasks--a 57-percentage-point absolute gain over the strongest baseline configuration. As a language model, OCellus uniquely generates natural-language explanations of its predictions, a capability absent from all competing methods. Code, pre-trained model weights, the graph-neural-network module, and the agent system will be made available upon publication.

16

WattmaMod enables high-resolution and extensible RNA modification profiling for nanopore direct RNA sequencing

Han, R.; Yu, B.; Xinghui, S.; Xiao, L.; Junhai, Q.; Ting, Y.; Xin, G.

2026-07-02 bioinformatics 10.64898/2026.07.02.735990 medRxiv

Top 0.2%

9.8%

Show abstract

Nanopore direct RNA sequencing enables direct profiling of RNA modifications on native transcripts, but accurate multi-modification detection remains limited by non-stationary signals and heterogeneity across chemistries. Here, we develop WattmaMod, a deep learning framework for multi-modification detection from nanopore direct RNA sequencing data. It combines self-supervised pretraining, supervised contrastive fine-tuning, and low-label incremental adaptation to improve representation learning and support efficient extension to low-resource modification types. The framework further incorporates wavelet-guided multi-scale encoding and dynamic cross-attention fusion to model raw signals and event-level features. Results show that WattmaMod achieves robust detection of multiple RNA modifications, including m6A, m5C, m1A, A-to-I, m7G, hm5C, m1{Psi}, f5C, ac4C, m5U and {Psi}. It also extends efficiently to low-resource modification types with minimal labeled data, generalizes across sequencing chemistries and species, and predicts potential higher-order local organization among distinct RNA modifications. WattmaMod thus provides a scalable framework for high-resolution epitranscriptome profiling and expands RNA modification analysis beyond single-site prediction to coordinated multi-modification characterization.

17

Navigating the pangenome coordinate system with Shredtools

Shivakumar, V. S.; Langmead, B.

2026-07-08 bioinformatics 10.64898/2026.07.03.736354 medRxiv

Top 0.2%

9.7%

Show abstract

Existing notions of pangenome coordinates rely on hard-to-compute multiple sequence alignments. On the other hand, pangenome-wide exact unique matches (multi-MUMs) can be computed efficiently, and represent conserved stretches of columns in the underlying MSA. We introduce Shredtools, which uses multi-MUMs as pangenome waypoints and allows for sophisticated queries in pangenome coordinates. Its primary query is extract, which takes an interval of one sequence and extracts the smallest window containing it that is syntenic pangenome-wide. Shredtools' extract query can extract a gene region from 476 human genomes in half a second. Other queries help to refine these results, by finding local exact matches to improve the density of multi-MUM coverage ("enhance") and by selectively discarding sequences to improve the precision of the syntenic region ("zoom"). The Shredtools web interface (available at https://vikshiv.github.io/shredtools) allows for client-side handling of extract queries with index queries handled via simple and fast HTTP Range requests, simplifying usage and enabling pangenome-scale discoveries.

18

Benchmarking large language models for ACMG/AMP variant interpretation and variant calling

Corpas, M.

2026-07-05 genomics 10.64898/2026.06.30.735646 medRxiv

Top 0.2%

9.5%

Show abstract

Agentic large language models are increasingly used across the genomic workflow, from variant calling to clinical interpretation, yet they are evaluated by accuracy alone, a single figure that cannot say whether a system is safe or where in the workflow a failure originates. We present ClawBench, a framework that attributes each outcome to the architectural layer that produced it across both halves of the canonical pipeline. Two design choices remove the confounds that make agentic genomics hard to evaluate: a temporally blinded truth set, in which every scored ClinVar label first became available only after the training cutoff of every model tested, and a fail-closed evidence contract that blocks evidence circular with the truth label. We score validity, safety, provenance and reproducibility, not accuracy alone, under a constraint gradient that relocates correctness from a model's prior into executed, validated code. We show three things. First, dangerous misclassification is rare and model-invariant, a controlled precondition of the executed architecture rather than a frontier, while fabricated evidence is measurable and is neutralised by execution. Second, different variant classes are rate-limited by different layers: loss-of-function variants by the deterministic combiner threshold, and rare missense by evidence formation, where evidence acquisition is asymmetric and capped and strength assignment is a recoverable layer that naive strength-licensing prompts confound. Third, for variant calling the arms separate not on whether a model can plan a pipeline, which all do, but on trust properties, pinning, provenance, auditability and reproducibility, which climb monotonically toward validated execution; and a local open-weight model reproduces the safety result yet meets the structured-output and provenance contract far less often than frontier models, a conformance gap rather than a capability or safety gap. An end-to-end join attributes failures across the whole workflow, separating a missed call from a propagated genotype error from a correctly called but misinterpreted variant.

19

Cross-architecture ensembling of DNA foundation models improves the precision and stability of chimera detection in long-read metagenomic bins

MinSeo, K.; Jae-Ho, S.

2026-07-07 bioinformatics 10.64898/2026.07.02.735979 medRxiv

Top 0.2%

9.5%

Show abstract

Motivation: Chimeric metagenome-assembled genomes (MAGs) that pool DNA from multiple organisms contaminate downstream analyses. Marker-gene tools such as CheckM2 miss low-level chimerism, and DNA foundation models have been proposed as a sequence-composition alternative, but whether large autoregressive models (Evo2, 7B parameters) outperform smaller contrastive models (DNABERT-S, 117M) has not been rigorously tested.

20

Targeted epigenetic repression of oncogenic transcription factors via CRISPR/dCas9 locus-specific silencing

Taifour, S.; Wallis, C.; Wang, E.; Woodward, E.; Waryah, C.; Dymond, L.; Woo, A.; Houghton, P.; Iyer, K. S.; Norret, M.; Evans, C. W.; Winteringham, L.; Gaudieri, S.; Blancafort, P.

2026-06-27 genomics 10.64898/2026.06.27.734664 medRxiv

Top 0.2%

9.3%

Show abstract

Despite the revolutionary impact of genome engineering tools in medicine, the safe and effective intracellular delivery of CRISPR remains a major obstacle for clinical applications. Here, we implement precision molecular medicine and delivery strategies based on CRISPR/dCas9 systems adapted for epigenetic repression (dCas9-KRAB) to silence oncogenic drivers with high genomic selectivity. As proof-of-principle, we target the EWSR1-FLI1 translocation, which encodes a chimeric and hard-to-drug oncogenic transcription factor driving approximately 85% of the cases of Ewing Sarcoma (EWS)-an aggressive malignancy affecting children and adolescents. We describe the development of a non-viral and programmable polymeric system for the delivery of dCas9-KRAB as ribonucleoprotein (RNP) payloads for selective EWSR1-FLI1 repression. We demonstrate highly efficient intracellular delivery of RNPs loaded in polyamide-amine (PAMAM) polymers functionalized by guanidino groups, resulting in robust silencing of EWSR1-FLI1 both in established cell line xenografts and in patient-derived xenografts (PDXs) of EWS. Moreover, silencing of EWSR1-FLI1 is accompanied by potent anti-tumor effects. To our knowledge, we describe the first non-viral platform for in vivo delivery of dCas9-KRAB/RNPs, which can be adapted for the repression of any oncogene. We further outline dCas9/RNP formulations for future therapeutic applications to treat poor-prognosis cancers driven by hard-to-drug oncogenes.